NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Ding, Qiyang; Zheng, Pengfei; Kudari, Shreyas; Venkataraman, Shivaram; Zhang, Zhao (November 2023, SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis)

Full Text Available
Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

https://doi.org/10.1145/3581784.3607042

Ding, Qiyang; Zheng, Pengfei; Kudari, Shreyas; Venkataraman, Shivaram; Zhang, Zhao (November 2023, ACM)

Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we propose the design of a proactive provisioner and investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient. Using production job traces from three GPU clusters, we train each model using a subset of the trace and then evaluate their generality using the remaining validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate ML methods. Our experiments show that the Mirage can reduce interruption by 17--100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.
more » « less
Intramolecular β-Alkenylation of Cyclohexanones via Pd-Catalyzed Desaturation-Mediated C(sp ³ )–H/Alkyne Coupling

https://doi.org/10.1021/jacs.0c02654

Wang, Chengpeng; Naren, Nevin A.; Zheng, Pengfei; Dong, Guangbin (May 2020, Journal of the American Chemical Society)
null (Ed.)
Full Text Available
Deacylative transformations of ketones via aromatization-promoted C–C bond activation

https://doi.org/10.1038/s41586-019-0926-8

Xu, Yan; Qi, Xiaotian; Zheng, Pengfei; Berti, Carlo C.; Liu, Peng; Dong, Guangbin (March 2019, Nature)

Full Text Available

Search for: All records